Skip to content

chore: rename deprecated orchestrator config keys#2327

Merged
mikasenghaas merged 1 commit intomainfrom
chore/rename-deprecated-config-keys
Apr 19, 2026
Merged

chore: rename deprecated orchestrator config keys#2327
mikasenghaas merged 1 commit intomainfrom
chore/rename-deprecated-config-keys

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented Apr 19, 2026

Summary

  • Rename [orchestrator.sampling][orchestrator.train.sampling] and [[orchestrator.env]][[orchestrator.train.env]] across all configs
  • Rename max_tokensmax_completion_tokens in sampling sections
  • Drops reliance on the deprecated auto-translation emitted by the orchestrator config validator

Validation

  • Ran uv run rl @ <config> --dry-run on all 38 modified RL configs — no deprecation warnings
  • Validated the two orchestrator-only partial configs (configs/debug/orch.toml, configs/ci/integration/rl_multi_run/orchestrator.toml) via direct OrchestratorConfig.model_validate — no deprecation warnings

🤖 Generated with Claude Code


Note

Low Risk
Low risk: this is a mechanical rename of TOML config keys/fields to match the current orchestrator schema, with no functional code changes. Main risk is mis-typed keys causing configs to be ignored or validation to fail at runtime.

Overview
Updates training configs across configs/ and examples/ to stop using deprecated orchestrator keys.

Specifically renames [orchestrator.sampling] to [orchestrator.train.sampling], [[orchestrator.env]] to [[orchestrator.train.env]] (and similarly for orchestrator-only partial configs), and replaces max_tokens with max_completion_tokens in sampling sections.

Reviewed by Cursor Bugbot for commit 6ce2707. Bugbot is set up for automated code reviews on this repo. Configure here.

Rename '[orchestrator.sampling]' -> '[orchestrator.train.sampling]',
'[[orchestrator.env]]' -> '[[orchestrator.train.env]]', and
'max_tokens' -> 'max_completion_tokens' across all configs to remove
reliance on the deprecated auto-translation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas requested a review from samsja April 19, 2026 16:10
@mikasenghaas mikasenghaas marked this pull request as ready for review April 19, 2026 16:10
@mikasenghaas mikasenghaas merged commit d2718f5 into main Apr 19, 2026
17 of 18 checks passed
joanvelja added a commit to joanvelja/prime-rl that referenced this pull request Apr 20, 2026
…nt (#1)

* feat(orchestrator): multi-actor debate env integration + tests

Wire prime-rl's orchestrator to the multi-actor debate environment
(forks/verifiers/verifiers/envs/debate*). Adds the orchestrator-side
glue and the unit-test suite for the debate env's W/G/M scoring path.

Source modules (src/prime_rl/orchestrator/):
- multi_actor.py: orchestrator dispatch for multi-actor episodes
- multi_actor_advantage.py: GRPO/RAE advantage computation across
  per-actor rewards, handles role-conditioned advantage attribution
- multi_actor_bridge.py: trajectory ↔ training-batch bridge with
  two-table output (one row per actor step), no flattening
- multi_actor_eval.py: eval-mode scaffolding for multi-actor rollouts
- eval_utils.py: small adjustments to thread multi-actor state through
  the eval loop
- vf_utils.py: small adjustments to surface the new env factory params
  (judge_client, judge_model, judge_max_retries, etc.) to verifiers
  load_environment
- .gitignore: ignore .DS_Store noise

Tests (tests/unit/orchestrator/):
- test_debate_env.py: 216-test coverage of DebateEnv rollout, W/G/M
  scoring, F2 short-circuit, state['error'] capture via maybe_retry,
  composed JudgeRubric grader+matcher, latest-step authority, MCQ
  fast path, judge wrap_opponent viewer_role threading, verdict
  collision validation, metrics/error_info split
- test_debate_fields.py: field extraction + scoring mode coverage
- test_debate_prompts.py: prompt rendering + opponent_wrap viewer_role
  + judge template loading
- test_multi_actor.py / test_multi_actor_bridge.py /
  test_multi_actor_e2e.py / test_multi_actor_eval.py: foundation
  multi-actor protocol coverage

Critical regression guard:
test_debate_env.test_score_rollout_captures_vf_error_from_grader —
verifies vf.InvalidModelResponseError from a composed grader_rubric
flows through _grade → _score_rollout_body → score_rollout's
except vf.Error → state['error'] (for maybe_retry retry discovery)
+ state['metrics']['errored_rollout']=1.0 + state['error_info']
{error_type, error_phase}. Single backend call (no implicit retry at
score_rollout level; retry layered correctly at run_group_attempt).

Suite: 216 orchestrator tests + 3 fork-internal JudgeRubric tests
= 219 passing, 0 failing.

* fix 2.5 -> qwen (#2286)

* fix 32-> 30 (#2287)

* feat: set tool_call_parser default to 'auto' (#2285)

* feat: set tool_call_parser default to 'auto'

Changed the default value of tool_call_parser from None to 'auto' to enable automatic tool call parser detection from model name by default. This provides better out-of-the-box experience for users working with tool-calling models.

* test: add unit tests for inference metrics collector

Tests parsing, aggregation (sum/max/mean), counter rates, histogram
latency, counter reset handling, server failures, and wandb logging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Revert "test: add unit tests for inference metrics collector"

This reverts commit 48eb049144b51ec9a2562358b398c2be46bc8eca.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: pre-download model weights in launcher (#2282)

* feat: pre-download model weights in launcher instead of using HF_HUB_OFFLINE

Remove hardcoded `HF_HUB_OFFLINE=1` from multi-node SLURM templates and
instead pre-download model weights via `snapshot_download` in the rl/sft
launchers before dispatching to local or SLURM execution. This ensures
weights are cached on the shared filesystem before training starts,
removing the need to manually pre-download models.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: pre-download model weights in launcher instead of using HF_HUB_OFFLINE

Remove hardcoded `HF_HUB_OFFLINE=1` from multi-node SLURM templates and
instead pre-download model weights via `snapshot_download` in the rl/sft
launchers before dispatching to local or SLURM execution. This ensures
weights are cached on the shared filesystem before training starts,
removing the need to manually pre-download models.

Also replace `format_time` with the verifiers-style two-unit display
(e.g. "1h 30m" instead of "1.50h").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: move pre_download_model to trainer.model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: log cache path when model is already downloaded

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove redundant cache log from pre_download_model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: move pre_download_model import to module top

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: skip download and log cache path when model already cached

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Revert "feat: skip download and log cache path when model already cached"

This reverts commit 5b2bac9f2bb9e22ae898fe54f66f48357931dc40.

* chore: keep HF_HUB_OFFLINE=1 in SLURM templates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: context parallelism for NemotronH Mamba layers (#2231)

* refactor(tests): relocate verifiers fork to sibling path, delete stub scaffolding

Move forks/verifiers/ to ../verifiers/ and switch pyproject to an editable
sibling install. Delete ~690 LOC of sys.path-injection + ModuleType stub
scaffolding from 7 orchestrator test files; tests now use normal Python
imports matching upstream verifiers conventions.

- pyproject.toml: verifiers source git-pin → editable path "../verifiers"
  with inline doc explaining sibling clone requirement
- _compat.py: try/except ImportError → importlib.util.find_spec guard
  (partial/broken transformers installs in training contexts still fail loud;
  cleanly-absent transformers in the fork venv takes the skip path)
- test_debate_env.py: FakeClient promoted to real vf.Client subclass;
  retry-loop tests use real maybe_retry + monkeypatched wait_none; dead
  _reraise_error_from_state helper + stale "module-level stub" comments
  deleted; _VFResponse/_VFUsage/_VFResponseMessage aliases dropped
- test_debate_prompts.py: _PROMPTS_DIR via importlib.resources (namespace-
  package & wheel-safe)
- Run-command docs added to test_debate_env.py docstring (cross-linked
  from fields/prompts docstrings)

Tests still require --noconftest because prime-rl's root conftest eagerly
imports prime_rl.trainer.world (torch/distributed). Orthogonal, out of scope.

Run (from fork venv):
  cd ../verifiers && uv run pytest \
    /path/to/prime-rl/tests/unit/orchestrator/test_*.py --noconftest

* Support runtime verifiers version override (#2274)

* Support runtime verifiers version override via VERIFIERS_VERSION env var

When set, the entrypoint reinstalls verifiers from the specified git ref
(tag, branch, or commit) before starting the main process.

* Drop --no-deps so transitive deps are updated with verifiers override

* Use --reinstall-package to only reinstall verifiers, not the entire dep tree

* fix: always ensure X-Session-ID and propagate extra_headers_from_state in elastic pool (#2283)

Two fixes:
- Use setdefault so X-Session-ID: example_id is always present for
  sticky DP-aware routing, even if user provides other
  extra_headers_from_state entries
- Propagate extra_headers_from_state when rebuilding clients in the
  elastic pool, so session headers survive pool refreshes

Keeps dp_rank_count as-is for direct DP rank routing.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Feat: fix cpu offloading patch to match upstream and remove a segfault (#2300)

* test(debate_env): explicit members required at construction

Exercises the new DebateEnv contract: empty/duplicate members raise,
and len(self.members) replaces _count_actors as the round-index divisor.

* fix(bridge): widen MemberRollout.example_id to int | str

EpisodeResult.base_example_id is typed int | str upstream, but the
bridge enforced int via _validated_example_id and TypedDict. Widen
MemberRollout.example_id to int | str and drop the int coercion
(keep the None check).

* test(kernel): assert KernelProtocolError is raised (and is a vf.Error)

Cover all three apply_action protocol-violation branches: wrong actor,
duplicate submission, and post-finished submission.

* fix(bridge): revert int|str widening — dataset and buffer still require int

Gatekeeper (HIGH): widening MemberRollout.example_id to int | str was a
local lie. verifiers.envs.environment._ensure_example_id coerces dataset
rows to int and prime_rl.orchestrator.buffer.Buffer keys its example
store by int. The first str id propagated through the bridge would blow
up non-locally at buffer-insert with a confusing stack trace.

Revert to int-only here and fail loud with a message pointing at the
two downstream layers. Full int | str propagation (dataset + buffer +
bridge together, with an integration test) is deferred to a follow-up.

* test(debate_env): cross-checks for members drift + cosmetic cleanup

Regression tests for the two cross-checks added in verifiers@d7ab4fb:
- test_debate_env_members_must_match_rubric_members (order-sensitive)
- test_debate_env_members_must_match_static_schedule_actors
- test_debate_env_skips_schedule_cross_check_for_dynamic_program

Also addresses auditor cosmetics:
- hoist KernelProtocolError / vf.Error imports to module top
- update stale docstring on test_kernel_rejects_wrong_actor

* test(orchestrator): migrate debate tests to channel-split Utterance

- Update Utterance fixtures to use raw_content/public_channel/private_channel.
- Replace strip_think/redact_think contract tests with parse_channels contract
  (hard-fail on unclosed, stray, multiple, nested).
- Replace unclosed-think privacy integration test with public_channel viewer
  check — leakage is now structurally impossible.
- Add apply_action malformed-markup rejection test.
- Fix attribution test schedule (add judge slot for members=[A,B,J]).
- Remove test_mcq_think_tag_stripped — think handling no longer lives in mcq.

* accept fully-qualified expert names in lora check (#2301)

* accept fully-qualified expert names in lora check

* ruff format

* refactor(bridge): dual-read member rewards (structured -> flat fallback)

Prefer state['member_rewards'][mid] (MultiAgentRubric contract).
Fall back to legacy flat metrics['reward/{mid}'] with one-time
deprecation warning per process. Structured key wins when both
present.

* refactor(advantage): extend RAE baseline key to (task, example_id, role_id)

Partitions EMA baselines across envs — previously, two envs with
overlapping example_ids would contaminate each other's role-conditioned
baselines. 'task' sourced from MemberRollout['task'] (= env name).

* test(rubrics): MultiAgentRubric contract + bridge dual-read + RAE task key

- contract: subclass populates member_rewards/member_metrics/episode_metrics
- score_group error boundary: KernelProtocolError in one rollout does not
  prevent scoring of other rollouts; defaults populated on failing state
- non-vf errors propagate (programming bugs escape loud)
- bridge prefers structured member_rewards, falls back to flat metrics
- RAE baselines partition by task (different envs do not contaminate)

* test(multi_agent_env): rollout, atomic commit, invariant, lineage cache

13 tests covering:
- init validation (empty/duplicate members, stray overrides)
- sequential rollout with correct member tagging + stop conditions
- priority ordering (error > schedule_exhausted > prompt_too_long)
- simultaneous slot atomic commit (all-or-none on mid-slot error)
- monotonic build_prompt invariant across a 4-slot rollout
- actor_overrides routing to per-member (client, model)
- lineage-scoped prefix match: A's second turn hits A's cache, not B's

* test(kernel): regression tests for native-think leak + quarantine

- test_parse_channels_strips_native_think_with_custom_tag: with pack
  configured think_tag='reason', native <think>secret</think> never
  reaches public_channel and is NOT promoted to private_channel.
- test_apply_action_quarantines_malformed_think_markup: malformed
  model output commits with parse_error flag instead of aborting;
  kernel-state violations (wrong actor) still raise.
- test_rollout_survives_benign_prose_with_bracket_words: 'I will
  <think> and answer' parses as quarantined, schedule still advances,
  peer member still gets to speak.

* test(kernel): assert exact whitespace contract in native-think strip test

Replace weak or-chain (pub == 'public  tail'.strip() or ...) with exact
assertion pub == 'public  tail'. Documents parse_channels' whitespace
contract: block excision preserves internal whitespace, outer strip()
only trims leading/trailing.

* fix: work around transformers lazy_load_kernel offline regression (#2276)

* fix(scheduler,bridge): narrow error catch + atomic reward schema

Scheduler:
  The blanket 'except Exception' in _process_finished_task swallowed
  every non-CancelledError — MemoryError, AttributeError, KeyError from
  dataset corruption, KernelProtocolError, OverlongPromptError — and
  converted them to silent sample loss. Hiding these during a migration
  is exactly the opposite of what we want. Narrowed to the two error
  classes verifiers.utils.async_utils.maybe_retry considers retryable:
  vf.InfraError (incl. TunnelError, SandboxError, BrowserSandboxError)
  and vf.InvalidModelResponseError (incl. EmptyModelResponseError).
  Everything else propagates loud.

Bridge:
  _resolve_member_reward worked per-member, which let a half-migrated
  rubric write structured for some members and flat for others on the
  same rollout, silently merging two schemas. Replaced with
  _resolve_reward_schema(members, ...) — atomic decision per rollout.
  If state['member_rewards'] is present it MUST cover every member;
  otherwise ValueError. Otherwise all members come from the legacy
  flat 'reward/{mid}' keys.

Tests:
  - test_bridge_partial_structured_rewards_raises (was
    test_bridge_structured_missing_member_falls_back) — inverts the
    semantic: partial coverage now raises instead of mixing.
  - test_bridge_flat_missing_member_is_none — legacy flat path still
    tolerates missing keys (preserves pre-migration semantic).

303/303 tests pass.

* test(multi_agent_env): TaskGroup cancellation + post-commit rollback

Two new tests covering the HIGH findings from round 2:

- test_simultaneous_slot_cancels_peer_on_first_failure: asserts peer
  actor never reaches its completion line when a sibling raises first
  (TaskGroup cancellation contract).
- test_simultaneous_slot_rolls_back_on_post_commit_hook_failure:
  asserts state["_kernel"] stays at the pre-slot snapshot and
  trajectory remains empty when on_step_committed raises mid-slot.

* test(debate_env): monotonic invariant + real-types e2e rollout

Adds two structural tests for Phase 5's DebateEnv refactor:

1. test_debate_env_build_prompt_monotonic_across_slots -- asserts that
   for each member, the slot_{N+1} prompt is a byte-equal extension of
   slot_N's prompt. The prefix-cache path in the token client depends
   on this, and breaking it silently turns an O(T) episode into O(T^2).

2. test_debate_env_end_to_end_real_types_rollout -- drives a full
   rollout + score on the production selfplay prompt pack with no mocks
   on core types (DebatePrompts, FieldSpec, DebateRubric). Only the
   client is faked. Verifies trajectory tagging, reward, and completion.

Also updates test_debate_complete_fires_when_schedule_exhausted to
expect the inherited 'schedule_exhausted' stop-condition name now that
DebateEnv inherits stop conditions from MultiAgentEnv.

* test(debate_env): migrate shim call sites, drop zombie consolidate tests

Mirror of the verifiers cleanup (c1ddf1d):
  * env.debate_complete(state)     -> env.schedule_exhausted(state)
  * env._resolve_actor(x)          -> env.resolve_actor(x)
  * delete _consolidate_messages import
  * delete test_consolidate_merges_contiguous_user_messages
  * delete test_consolidate_does_not_merge_system_messages

The two dropped tests asserted behavior that no longer runs in
production (build_prompt stopped calling the consolidator in the
monotonic refactor). 94 tests pass, was 96.

* vf bump (#2302)

* fix: clean stale rollouts and broadcasts on fresh runs (#2304)

Previously `clean_future_steps` only ran when resuming from a checkpoint,
so a fresh run started in an output_dir containing stale rollouts or
broadcasts from a previous run would consume them: the trainer would
train on stale data and the orchestrator would compute a negative async
level because it sees a trainer that is seemingly ahead of it.

Run the same cleanup from step 0 when training from scratch so these
artifacts are removed before training begins.

* test(maenv): regression tests for fold / positional round_index / strict pack validation

10 tests covering:
- fold_consecutive_user_messages: idempotence, SA tool no-op, tool-metadata
  preservation, multimodal content-list safety, merged-user metadata carry.
- DebateEnv.build_prompt end-to-end: folded rollout prompts produce a single
  trailing user msg that _is_valid_env_tail accepts; prefix byte-equality
  between slot-N cache and slot-N+1 prompt.
- DebateEnv positional round_index: sparse slot_ids (10, 20, 30, 40) render
  the same past-instruction text as contiguous (0, 1, 2, 3).
- DebatePrompts._validate: rejects round_index in system, phase in question,
  accepts turn-invariant templates even when user block references per-turn
  vars.

* test(maenv): drop hardcoded sys.path; tighten multimodal fold assertion

Auditor flagged:
- sys.path.insert with hardcoded /Users/joanvelja/... path — works only
  on the laptop, breaks CI. Dropped; the sibling-fork venv already has
  verifiers importable.
- unused `import yaml`. Dropped.
- test_fold_skips_multimodal_content_lists asserted only len(folded)==2,
  weak. Now asserts folded == msgs byte-for-byte and confirms the
  image_url structural part is preserved.

* test(maenv): 6 regression tests for AST validator + per-member num_rounds

AST validator bypass coverage (all were silent under the regex):
- {% if is_first_round %} statement-tag bypass
- {{ hints[round_index] }} index-access bypass
- {% set r = round_index %} set-directive bypass
- is_first_round variable (was missing from original list)

Per-member num_rounds:
- simultaneous schedule [AB, AB]: num_rounds == 2 per member (not 1)
- asymmetric schedule: A=3 / B=2 (not 5//2=2 for both)

* chore: bump vllm-router to v0.1.22 (#2292)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(multi_actor_advantage): use defaultdict for per-key aggregation

Replace manual .get-or-default pattern in key_sums/key_counts with
defaultdict. Iterate via .items() in the update loop instead of
re-indexing by key.

* refactor(bridge): drop flat-metrics fallback, require member_rewards

Pairs with verifiers commit removing member_metrics/episode_metrics.
Now that every rubric must write state['member_rewards'], the bridge's
legacy fallback to metrics['reward/{mid}'] (and its one-time deprecation
warn, module global, helper layer) is dead.

_resolve_reward_schema → _resolve_member_rewards: one-shot lookup,
raises on absence or partial coverage. No schema decision, no fallback.

Test migration:
- test_multi_agent_rubric: drop member_metrics/episode_metrics
  assertions; contract is now just member_rewards.
- test_multi_actor_bridge: _make_rollout_output uses member_rewards
  parameter (was metrics with reward/{mid}). Dropped three legacy-
  fallback tests (falls-back-to-flat, flat-missing-is-None, prefers-
  over-flat) → replaced with one partial-coverage-raises contract
  test and one missing-member-rewards-raises test.
- test_debate_env full-pipeline test patches member_rewards['J'] for
  the post-rollout injected judge step.

* feat(buffer,bridge): accept int | str example_id end-to-end

Buffer's isinstance check + example_buffer type signature widen to
int | str. The dict keys int | str without any code change — Python
hashes both cleanly.

Bridge MemberRollout.example_id + _validated_example_id widen to
int | str (previously int-only with a gate rejecting str). The
gate-and-revert dance from earlier in this PR goes away now that the
three layers (dataset, buffer, bridge) are consistent.

Test migration: test_str_example_id_rejected_until_dataset_and_buffer_support_it → test_str_example_id_flows_through_bridge. The rejection
semantic is now a positive test for propagation.

Note: prime-rl venv is linux-only per lockfile, so the buffer-side
torch-dependent integration tests can't run on Darwin. The type widen
is structurally verified: isinstance check accepts both; dict keys on
both; bridge round-trip test on a str id passes end-to-end through the
non-torch layer.

* fix: check rollout error before empty trajectory in scheduler (#2308)

When `verifiers` CliAgentEnv catches an agent crash pre-LLM-call, it
sets `state["error"]` but the trajectory stays `[]` because the agent
never produced any messages. The previous branch order fired the
"Empty trajectory" warning first and dropped the detailed AgentError
diagnostic. Swap the branches so error-bearing rollouts surface
"Rollout error ...: {error_chain_repr}" instead.

Related: PrimeIntellect-ai/verifiers#1127, #1130

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(eval): restore per-rollout isolation + correct total_turns fallback

Colleague review flagged three regressions in the initial MA commit:

P1 pyproject verifiers source (already reverted to git pin).

P2 eval failure semantics (vf_utils.py): the earlier change dropped
_get_eval_inputs flattening and passed rollouts_per_example=K into
generate(). That routes through env.run_group(), which uses
asyncio.gather() WITHOUT return_exceptions=True and retries the whole
K-group on any raise. One transient failure past max_retries dropped
every rollout for that example, biasing pass@k / avg@k toward examples
that never flake.

Verified inertness before reverting: DebateRubric / MultiAgentRubric
declare no GroupRewardFunc; multi_actor_eval groups on
base_example_id post-hoc. The change enabled no active consumer, so
reverting loses nothing currently used.

Revert: keep _get_eval_inputs flattening upfront, pass
rollouts_per_example=1 so each rollout is its own run_group call.
Comment documents the trade-off for future comparative rubrics.

P3 total_turns fallback (multi_actor_eval.py): len(r.members[0].trajectory)
counted one participant's steps. An alternating A/B schedule
under-reported by factor 2; A/B/J by ≈3. Fixed to
sum(len(m.trajectory) for m in r.members).

* fix: serialize env server spawn to avoid port race (#2310)

get_free_port() only holds the port until it returns, so parallel
env spawns under asyncio.gather could hand the same port to two
children — the loser died with EADDRINUSE. Serializing start()
and awaiting wait_for_server_startup() between envs ensures each
port is bound before the next one is picked.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add FA4 (flash_attn.cute) support to ring attention, enabling context (#2307)

parallel training with FA4 kernels. Mirrors the FA3 ring attention
pattern (all-gather K/V, compute per GQA stride, reduce-scatter grads)
using FA4 low-level _flash_attn_fwd/_flash_attn_bwd.

Changes:
- ring_attn.py: FA4 forward/backward wrappers, _RingFA4Varlen autograd
  Function, ring_fa4_varlen_func public API
- attn.py: route FA4 to ring_fa4_varlen_func in substitute_ring_attn
- trainer.py: allow CP with fa4 (requires model.impl='custom')

* Fix Prime monitor public API flow (#2205)

* Use bearer auth for Prime monitor uploads

* Fix Prime monitor presign and finalize flow

* Sanitize non-finite Prime monitor payloads

* Simplify Prime monitor payload normalization

* Simplify Prime monitor public API contract

* Simplify public presign response parsing

* Simplify non-finite payload sanitization

* Inline public presign response parsing

* Inline non-finite payload sanitization logic

* Refine Prime monitor JSON sanitization

* Address review: inline auth headers and simplify sanitize

- Remove _api_headers() helper; store self._headers once in __init__
- Always sanitize payloads; drop silent try/except and log only when values are dropped
- Remove prime_cli sys.modules mocking from tests (real dep is installed)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: sami jaghouar <sami@primeintellect.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: remove prefix-cache-salt and reset-prefix-cache config flags (#2314)

* chore: remove prefix-cache-salt and reset-prefix-cache config flags

Hardcode the defaults: always set cache_salt on inference requests
(keyed by ckpt_step) and never reset the prefix cache after weight or
LoRA updates. The salt alone is sufficient to invalidate stale KV
states across policy updates, so the reset path is redundant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: keep empty experimental sub-configs as extension points

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump verifiers pin to a036fce (includes v0.1.12 sync)

Upstream verifiers main was merged into our feat/debate-env branch
(github 'Sync fork' → merge main). Commit a036fce on
joanvelja/verifiers brings in v0.1.12:
- TITO tool-shape dummy assistant fix (stitcher defensive)
- json_logging propagation to env workers
- swebench root-logger hijack fix
- tomllib/tomli py3.10 guard
- CliAgentEnv dead-tunnel fix + AgentError double-wrap fix
- NeMoRLChatCompletionsClient available as actor_overrides target
- composable Task/Agent/Environment experimental (orthogonal to MA)

332/332 multi-actor tests green against new pin. No MA-path changes
required — upstream surfaces (RLM, CliAgent, composable, CLI eval)
are orthogonal to our MultiAgentEnv stack.

* refactor(orchestrator): MARScore bridge + P0 fixes + dead Path-B removal

Pairs with verifiers e04c8f5 (MARScore + MemberScore + factory rewiring).
The bridge now reads the typed ``state["mar_score"]`` payload directly —
dropped 5-key dict plumbing, schema drift is structurally impossible.

Bridge
  - multi_actor_bridge: rewrite rollout_to_member_rollouts to read
    output["mar_score"] (verifiers.types.MARScore). Drops
    _resolve_member_rewards, _validated_example_id, _member_to_rollout,
    and the dead episodes_to_member_rollouts (Path-B push protocol).
    Auto-coerces dict -> MARScore via model_validate, so the wire format
    (in-memory object vs. JSON-round-tripped dict) is transparent.

P0-2 quarantine masking
  - trajectories.interleave_rollout: check
    step["extras"]["parse_error"] and mask completion tokens (both
    make_sample + extend_sample paths). Previously only the global
    output["error"] gated masking, leaking malformed model tokens into
    training despite the kernel's per-utterance quarantine.

Scheduler widen
  - scheduler: TimeoutError added to the retryable-transient catch
    alongside (vf.InfraError, vf.InvalidModelResponseError). The env
    server client raises built-in TimeoutError on recovery timeouts;
    those stalls should follow the same drop-and-refill path.
  - test_scheduler: regression test asserting a mid-group TimeoutError
    is dropped, the group state is cleaned, and the remaining rollouts
    proceed.

Path-B graveyard (zero production callers; blockquote confirmed by
grep across both repos)
  - Delete multi_actor.py (197 LOC) — run_episode / run_episode_group
    consumer of the MultiActorEnv Protocol. No implementation of the
    Protocol exists in either tree.
  - Delete multi_actor_eval.py (135 LOC) — evaluate_multi_actor_episodes
    consumes EpisodeResult (Path-B). Duplicates eval_utils._pass_at_k.
  - Retain multi_actor_advantage.py (RAE baselines). Path-B-tagged but
    reusable: MemberRollout-compatible, per-(task, example_id, role_id)
    partitioning — the obvious advantage path for MA training wiring.
    Annotation widened to tuple[str, int | str, str] to match the
    MemberRollout.example_id int|str contract end-to-end.

Tests
  - test_multi_actor_bridge: fixtures rebuilt to construct RolloutOutput
    via the real state_to_output -> JSON round trip. Closes the test-
    fabrication hole that hid the original P0 (state["member_rewards"]
    silently dropped at serialization).
  - test_multi_agent_rubric: updated for MARScore contract; adds
    coverage that base rubric does NOT overwrite subclass's partial
    mar_score on vf.Error.
  - test_marscore_stress: 33 adversarial property tests across 10
    sections (schema invariants, round-trip fidelity, SA fallback,
    dict/object bridge input, P0-1 ExceptionGroup flattening, P0-2
    quarantine propagation, P0-4 fork/merge isolation, errored-rollout
    round-trip, schema enforcement, projection invariants).
  - test_multi_actor_advantage: dedicated suite for RAE (cold start,
    EMA, per-role/example/task baseline independence, ordering
    invariance, repeated-key mean update, str example_id).
  - test_debate_env / test_debate_prompts: migrated assertions to the
    new contract via inline _views helper (legacy-shape projection of
    mar_score for backwards test readability) and the
    DebatePrompts.__post_init__ verdict-token collision check now fires
    at pack construction (was in load_environment).

pyproject: bump verifiers pin a036fce -> e04c8f5.

* refactor(orchestrator): consume verifiers multi-agent bridge

* refactor: unify actor→agent naming across orchestrator multi-agent modules

Paired with verifiers 638504d (same rename + build_prompt decomposition +
arch doc). Zero behavior change on the prime-rl side — mechanical
consumer-side rename.

- Rename multi_actor_advantage.py -> multi_agent_advantage.py (git mv)
- Rename multi_actor_bridge.py -> multi_agent_bridge.py (git mv; still a
  thin compat shim that re-exports verifiers' rollout_to_member_rollouts
  and MemberRollout)
- Rename test_multi_actor_* -> test_multi_agent_* (git mv)
- Update imports: verifiers.envs.multi_actor_kernel -> multi_agent_kernel
- Update field access: slot.actors -> slot.agents
- Update identifier names: actor_overrides -> agent_overrides etc.
- "member"/member_id/member_rewards unchanged — distinct roster-level concept
- Bump verifiers pin: e04c8f5 -> 638504d

331 multi-agent tests pass unchanged.

* feat: drop filtered rollouts instead of masking (#2277)

* feat: drop filtered rollouts from training batch instead of masking

Previously, enforced filters zeroed the completion_mask on detected
rollouts but still sent them through the entire training pipeline.
This wastes compute on samples that contribute nothing to the loss.

Now, `apply_filters` returns the subset of rollouts that should be
sent to the trainer. Enforced-detected rollouts are excluded before
pretokenization, VLM cache building, and sample construction.

The trainer handles the resulting empty batches ("phantom steps") by
skipping forward/backward and logging `data/is_empty_batch`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: retry empty filtered batches instead of passing them to trainer

Keep the invariant that the trainer only receives non-empty batches. If
all rollouts are filtered out, regenerate the batch (up to 3 retries)
and crash the orchestrator on sustained failure. Warn at <=10% trainable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: drop redundant num_rollouts guard

The retry loop only breaks when len(filtered_rollouts) > 0, which implies
num_rollouts > 0, so the guard is unreachable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: expand low-trainable-ratio warning with env review hint

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: drop empty-df guard and inline filtered metrics

filtered_rollouts is guaranteed non-empty after the retry loop, so the
empty-df branch is unreachable and the intermediate locals add no value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: hoist MAX_EMPTY_BATCH_RETRIES to module scope

Also rename the loop var and warning message to "retry N/MAX" so the
counter excludes the initial attempt and reads less ambiguously.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: adjust log style in filter retry warnings

Drop trailing periods and replace ";" with " - " as clause separator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: clarify low-trainable-ratio warning hint

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: compute metrics over all rollouts, drop only from trainer

Metric logging reverts to main's semantics: all rollouts contribute to
prefill_len, decode_len, samples_per_rollout, and results_df. Filtered
rollouts are still pretokenized and interleaved, but their samples are
simply not added to train_examples. Also inline the generate_batch
coroutine since it is awaited immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: move filter flags to rollout["filter"] + is_filtered

Per-filter detection booleans now live under rollout["filter"], and a
top-level rollout["is_filtered"] captures whether any enforcing filter
triggered. The orchestrator uses is_filtered directly as the keep gate
(no more id() mapping). apply_filters no longer returns filtered_rollouts
- the in-place flags are the single source of truth. Also unbound-var
fix for retry-loop locals, and per-env filter/<env>/<flag>_rate logging
that mirrors the metrics logging pattern.

Both new fields are serialized to train_rollouts.jsonl via save_rollouts,
which already writes all top-level rollout keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: cap to 3 total batch-generation attempts, not 3 retries

Rename MAX_EMPTY_BATCH_RETRIES to MAX_EMPTY_BATCH_ATTEMPTS and have the
loop run exactly that many times. Warning now reports the attempt that
just failed ("Attempt N/3 ... retrying").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: log error line before raising on exhausted retries

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: align filter metric key names with per-env logging

Rename filter/total_detected_rate -> filter/detected_rate and
filter/total_enforced_rate -> filter/is_filtered_rate so the overall
keys mirror the per-env filter/<env>/is_filtered_rate naming.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: unify filter logging under filter/{all,<env>}/{<filter>,is_filtered}

Move is_filtered into results_df so it can be aggregated per-env like
is_truncated. filter_df now holds just per-filter detection booleans.
apply_filters no longer returns an aggregate metrics dict - the
orchestrator derives the rates uniformly across the "all" and per-env
scopes, with symmetric key naming and no _rate/_count suffixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: rename rollout["filter"] to rollout["filters"] + log keys

Aligns with the plural configs list and the rollout-level "filters"
namespace. Log keys change from filter/{all,<env>}/... to
filters/{all,<env>}/....

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: self-evict orchestrator when batches carry no learning signal

Write control/evicted.txt before raising, so the multi-run manager
skips the run on rediscovery instead of treating it as a hard crash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* update dependency (#2317)

Co-authored-by: Mika Senghaas <mail@mikasenghaas.de>

* test(maenv): update fold contract tests to typed Messages

verifiers' fold_consecutive_user_messages narrowed from
(Messages | list[dict]) → list[dict]
to:
Messages → Messages
— typed in, typed out, with model_copy preserving extras (e.g.
OpenAI `name` field under CustomBaseModel extra="allow"). Tests
updated to construct typed UserMessage / SystemMessage /
AssistantMessage / ToolMessage inputs and assert via attribute
access (m.content, m.role) instead of dict indexing.

End-to-end roundtrip test simplified: _is_valid_env_tail's _get_role
helper accepts both attr and key access, so we pass typed messages
straight through without model_dump.

* chore: rename deprecated orchestrator config keys (#2327)

Rename '[orchestrator.sampling]' -> '[orchestrator.train.sampling]',
'[[orchestrator.env]]' -> '[[orchestrator.train.env]]', and
'max_tokens' -> 'max_completion_tokens' across all configs to remove
reliance on the deprecated auto-translation.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(multi_agent): align bridge/advantage to verifiers α-cut API

Bumps verifiers pin from 638504d → b723fda. Verifiers' α-cut deleted
role_id as a redundant duplicate of member_id (the dual labeling poisoned
RAE baseline buckets when MemberScore.role_id and MemberRollout.role_id
diverged on errored rollouts). prime-rl now follows the cut end-to-end.

API alignment:
* MemberScore / MemberRollout / TrajectoryStep extras drop role_id
* DebateEnv constructor drops role_for_agent kwarg (pack prompts key by
  member_id directly)
* DebateRubric kwarg truth_role → truth_member
* rollout_to_member_rollouts(output) — env_name positional dropped;
  bridge no longer overwrites output["task"]
* MARScore.to_wandb_flat() → to_metrics_flat()
* Errored MARScore episode_metrics is now {"errored_rollout": 1.0} only;
  error_type / error_phase moved to MARScore.episode_error
* MultiAgentEnv._flatten_exception_group removed (asyncio.TaskGroup
  replaced by asyncio.wait — no flattening needed)
* DebateRubric._count_parse_errors removed; counting now lives in
  member_snapshot which returns parse_errors as part of a per-member dict
* DebatePrompts.wrap_opponent / build_context kwargs viewer_role/role_id
  → viewer_id/member_id
* DebateRubric.judge_client lazy: construction succeeds without it;
  verdict() raises at score time. _grade/_match collapsed into verdict()
  raising vf.Error (not RuntimeError)

src changes:
* multi_agent_advantage.RAEKey docstring + key construction:
  (task, example_id, role_id) → (task, example_id, member_id)

Test changes (updates, no deletions of behavior coverage):
* Member naming restructured: env-rollout tests use members=
  ["prover","verifier"]; rubric/score-time tests use ["debater_a",
  "debater_b","judge"] so member_ids match prompt-pack keys directly
* Stale-behavior tests repurposed to assert the new fail-loud /
  captured-error / no-overwrite contracts (e.g.
  test_round_trip_preserves_role_id → test_round_trip_preserves_member_id_assignment)
* test_bridge_raises_on_missing_sampling_args → repurposed to assert
  that omitted temperature defaults to 1.0 (sampling_args is now always
  projected as {} by state_to_output)
* Loser zero_sum_reward asserted as -1.0 (was 0.0 — current
  zero_sum_reward is winner+1 / loser-1 / judge 0 / tie 0)
* Tests covering removed eager judge_client validation gates flipped to
  assert score-time verdict() failure instead

330 / 330 collectable orchestrator unit tests pass.

* fix(multi_agent_advantage): SPIRAL Alg.1 ordering — update EMA before subtract

Previous code did subtract-then-update with per-batch mean aggregation:
  for τ in B:  A(τ) = R(τ) - b
  b ← α·b + (1-α)·mean({R(τ)})

SPIRAL Alg.1 (arxiv:2506.24119, lines 18-22, verbatim):
  for (τ, G_i) ∈ B do
    for p ∈ {0, 1} do
      b_{G_i,p} ← α·b_{G_i,p} + (1 - α)·R_p(τ)         [line 20]
      A_{G_i,p}(τ) ← R_p(τ) - b_{G_i,p}                  [line 21]

Per-trajectory, update-then-subtract. Each rollout's advantage is
computed against the baseline that has just absorbed its own reward;
sequential rollouts sharing a key compound through the EMA recursion
rather than collapsing to a single mean update.

Numerical impact (cold-start, momentum=0.9):
                            OLD       NEW
  single R=1.0          A=1.0     A=0.9        (=α·R)
  rep-key [1.0, 0.0]    A=[1, 0]  A=[0.5, -0.25]   (mom=0.5)
  end baseline          0.25      0.25         (same in this case)

For sequential batches the divergence compounds: at α=0.9, after 20
rounds of R=1, OLD baseline=0.878 (advantage 0.122), NEW baseline=0.878
(advantage 0.122) — converges asymptotically. The within-batch
ordering invariant the previous implementation relied on no longer
holds: see test_within_batch_ordering_compounds_per_trajectory.

Tests updated (5):
* test_cold_start_advantage_equals_reward → ..._is_reward_minus_post_update_baseline
  (asserts α·R = 0.9 instead of R = 1.0)
* test_second_batch_uses_updated_baseline (asserts [0.9, 0.81]
  instead of [1.0, 0.9])
* test_within_batch_ordering_invariant → ..._compounds_per_trajectory
  (asserts that order DOES matter — distinct end baselines)
* test_repeated_key_in_batch_uses_mean_for_baseline_update →
  ..._compounds_per_trajectory (asserts per-trajectory recursion, no
  mean aggregation)
* test_zero_reward_from_errored_rollout_keys_correctly (A=-0.35
  instead of -0.7 — baseline is updated before the subtract)

Other 8 tests unchanged: cold-start single-key, distinct keys (per-
member, per-example, per-task), str example_id, none reward, empty
batch, degenerate group, baselines_update_after_batch.

13 / 13 advantage tests pass; 330 / 330 collectable orchestrator tests
pass.

* feat(ckpt): persist RAEState alongside progress + buffer

CheckpointManager.save / load now accept an optional rae_state: RAEState
| None. When set, the EMA baselines + momentum are serialized to
rae_state.pt next to progress.pt; when omitted, no file is written. On
load with rae_state set but file missing, we FileNotFoundError loudly
rather than silently cold-starting — discarding EMA history mid-run is
the kind of "training looks fine but has invisibly worse variance" bug
the no-silent-fallbacks rule exists to prevent.

Single-agent runs are unaffected: callers that pass rae_state=None (the
default) get the original save/load behavior with no rae_state.pt
written or expected.

Test: round-trip + missing-file + omit-on-save (3 cases). Skipped on
Darwin where torch isn't importable from the verifiers venv we run from
— runs cleanly on Linux with prime-rl's full deps.

* feat(orchestrator): route multi-agent rollouts through RAE per-member path

Detects MultiAgentRubric on the env group at startup and branches the
per-step training pipeline:

  episode rollout (1 per inference call)
      ├─[single-agent]→ compute_advantages (GRPO) → 1 training unit
      └─[multi-agent]──→ rollout_to_member_rollouts (verifiers bridge)
                          ↓
                         drop judge member (config.rae.drop_judge=True default)
                          ↓
                         compute_rae_advantages (SPIRAL Alg.1)
                          ↓
                         N training units (one per member)

Both paths feed into the same downstream pretokenize → interleave_rollout
→ TrainingSample assignment. Per-rollout metrics (results_df) preserve
single-agent shape — per-unit token counts fold back via a
``rollout_to_unit_idxs`` mapping.

Guardrails:
* mixed MA + single-agent envs in one EnvGroup → NotImplementedError
  (different per-step branching, defer hybrid until a real use case shows)
* MA + VLM → NotImplementedError (image cache key fan-out unimplemented)
* RAE state lifecycle: instantiate at startup, persist via ckpt.save,
  restore via ckpt.load on resume (rae_state.pt round-trip)
* Judge filter is opt-out (config.rae.drop_judge=True default) — judge
  has reward=0 by zero_sum_reward construction, training those tokens
  burns gradient compute on policy-neutral noise

New config: ``rae: RAEConfig`` with ``momentum`` (Alg.1 α decay, default
0.9) and ``drop_judge`` (default True). Single-agent runs ignore it.

New helper: ``fan_out_for_multi_agent(rollouts, drop_judge) -> (units,
rollout_to_unit_idxs)`` extracted from the orchestrator inline so the
fan-out logic is independently testable. 5 fan-out unit tests cover
judge-drop, judge-keep, multi-rollout index mapping, end-to-end pipe
into compute_rae_advantages, and empty-batch.

Stage 3 follow-ups (separate PRs, not blockers for this wiring):
* verifiers-side ``agent_overrides_resolver`` for per-episode learner
  seat assignment (gates first training run)
* prime-rl filter to keep only ``member_id == row["learner_seat"]``
  units (depends on the verifiers PR landing)

335 / 335 collectable orchestrator tests pass. Wiring change: 174 LOC
(orchestrator.py: 122, advantage helper: 33, config: 34, minus 15
removed lines) — well under the briefing's 300-LOC bail-out.

* fix(orchestrator): bind use_rae before VLM gate; persist RAEState in final ckpt

Two bugs caught in Codex review of the multi-agent wiring:

P1 (BLOCKER, every launch): the ``if use_rae and is_vlm`` guard at
line ~146 read ``use_rae`` before the MA detection block at line ~220
assigned it. Python's local-scope rule promotes ``use_rae`` to local
throughout the function as soon as ANY assignment exists, so the
earlier read raised ``UnboundLocalError`` on EVERY orchestrate()
invocation — single-agent and multi-agent alike. Moved the VLM+MA
gate inside the ``if use_rae:`` block where ``use_rae`` is bound.

P2 (data loss on resume): the final ``ckpt_manager.save`` after the
loop didn't pass ``rae_state=``. Multi-agent runs that finished on a
non-interval step wrote a checkpoint without ``rae_state.pt``;
resume from that checkpoint then hit the load-side
FileNotFoundError that ckpt.py raises by design (no silent
cold-start). Added the kwarg.

Static AST invariants test added — three properties caught both bugs
without needing the heavy orchestrate harness:

* use_rae: first Load (by source line) ≥ first Store
* rae_state: same invariant
* every ``ckpt_manager.save / load`` call passes ``rae_state=``

These trigger on the bytecode shape, not behavior, so they catch the
class of bug at parse time. ``ast.walk`` is BFS, not document-order,
so the test takes ``min`` of all line numbers per ctx rather than
``first encountered`` — initially passed P1 spuriously because the
deeper Load node was visited later than the shallower Store node.

339 collectable orchestrator tests pass + 1 skipped (torch-gated).

* refactor(advantage): unify [rae] into [advantage] union; split [multi_agent] for routing

Surfaces the orthogonality of pipeline stages that the previous shape
conflated. RAE is a baseline-subtraction layer (stage 3); MA fan-out is
routing (stage 2); loss is a separate function in the trainer (stage 5).
The previous ``[rae]`` block at the top level made it look like RAE was
a coupled "MA path" — it isn't. RAE composes with any loss; you can run
SPIRAL EMA + asymmetric IPO clip + length-shaped reward independently.

Config surface (was → is):

  [rae]                       [advantage]
    momentum                    type = "ema_per_member"  ← discriminator
    drop_judge                  momentum

                              [multi_agent]
                                drop_judge

The advantage discriminated union now has three variants:

  type = "default"          GRPO group-mean baseline (single-agent only)
  type = "ema_per_member"   SPIRAL Alg.1 EMA per (task, ex, member_id)
  type = "custom"           import_path + kwargs

Cross-validation at orchestrator startup (pydantic can't see the rubric):
* MA env + type="default" → ValueError (samples_per_problem grouping
  ambiguous after fan-out)
* SA env + type="ema_per_member" → ValueError (member_id key meaningless)
* MA env + type="custom" → permitted (user's responsibility)

Orchestrator changes:
* ``use_rae`` → ``is_ma`` (gates stage 2, not stage 3)
* ``rae_state`` → ``advantage_state`` (generic — placeholder for any
  stateful estimator we add later; currently only RAEState lives there)
* Per-step branching: stage 2 (fan-out) is independent of stage 3
  (advantage). The dispatch ``if advantage_type == "ema_per_member"``
  picks the per-unit estimator vs the flat-rewards GRPO/custom path.
* drop_judge moved from ``config.rae.drop_judge`` to
  ``config.multi_agent.drop_judge`` — it controls fan-out filtering, not
  baseline computation.

Static invariants test refactored to a parametrizable helper; added
checks for ``advantage_type`` and ``advantage_state`` to catch the same
class of UnboundLocalError that bit ``use_rae`` (P1 in commit 1e013eee0).

Net change: 340 / 340 tests pass + 1 skipped. No behavior change for
single-agent runs; multi-agent runs that previously used ``[rae]`` need
``[advantage] type = "ema_per_member"`` + ``[multi_agent]`` instead.
Greenfield repo, no compat shim.

* feat(slurm): cleanup stale node-local state before launch (#2331)

* feat(slurm): cleanup stale node-local state before launch

Add a pre-workload srun step to the multi-node RL, multi-node SFT and
inference sbatch templates. It runs once per node and:

- kills orphan python/torchrun/vllm/prime_rl processes left over from a
  prior job that wedged after scancel (SLURM doesn't always reap cleanly
  when a job sits in CG for hours)
- removes stale vLLM and torch IPC state under /dev/shm/vllm-*,
  /tmp/vllm-*, /tmp/torch-*, /tmp/torchelastic_*

Without this, decode engines on previously-used nodes can hang at
"Waiting for READY message from DP Coordinator" because the new vLLM
process finds a stale /dev/shm segment or port holder from the dead run.
Symptom we hit: a fresh job timing out after 1800s because 4 decode
engines never became READY; a manual pdsh cleanup of the same nodes
fixed it immediately.

Each node prints one line (hostname, residual proc count, total GPU
memory in use) so the sbatch log shows the nodes came up clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(slurm): explicitly cover vllm-router in cleanup

Address review feedback: add vllm-router to the pkill list and the
procs-count regex so the intent is explicit, even though the broader
"vllm" patterns already match it as a substring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(slurm): also kill prctl-named vllm::router workers

pkill -f only matches the command line, so the vllm router's worker
processes — which set their kernel process name (comm) to "vllm::router"
via prctl but keep a different cmdline — slip through. Add process-name
pkill for "vllm" and "vllm::.*" to catch them.

Also broaden the post-cleanup procs count to look at both comm and args
(ps -eo comm,args) so we see these if any survive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: add conservative testing guidelines to AGENTS.md (#2330)

* configs: add gpqa_{rlvr,debate,consultancy} recipes

Three protocol comparisons on the same dataset (GPQA Diamond), same
model size (Qwen3-4B), same eval — what changes is where the reward
signal comes from:

  recipe                  reward source                  advantage
  ──────                  ─────────────                  ─────────
  gpqa_rlvr/rl.toml       verifier (exact letter match)  default GRPO
  gpqa_debate/            judge (winner-take-all)        ema_per_member
    rl_selfplay.toml                                     (SPIRAL Alg.1)
  gpqa_consultancy/       judge (picks assigned answer)  default GRPO
    rl.toml

The three are designed for direct A/B comparison: identical model,
batch size, sampling temperature, eval cadence. The diff is one
[advantage] block (or its absence) and the [[orchestrator.env]] id.

Status:
* gpqa_debate.rl_selfplay: works today against existing
  verifiers/environments/gpqa_debate package
* gpqa_rlvr + gpqa_consultancy: require new env packages in
  verifiers (sketches in environments/gpqa_rlvr and
  environments/gpqa_consultancy on a paired commit there)

Configs/ is informational per the README; not test-validated.

* test(debate_env): align packs with new schedule×prompts coverage check

verifiers commit 44f875e1 added an init-time cross-check on DebateEnv:
every (member_id, phase) in a StaticSchedule must have a matching
template in the prompts pack (system / question / user[member][phase]
or user[member]['default'] fallback). Several existing tests built
intentionally-incomplete packs and relied on the silent-no-instruction
failure mode the check now rejects.

Updates:
* DEBATE_PROMPTS top-level fixture: add opaque-label aliases (A, B, X,
  Y) for kernel-level cross-check tests that exercise members=
  validation against prover/verifier-keyed packs, and a 'default' user
  phase for prover/verifier so phase-specific schedule overrides
  (simultaneous etc.) don't trigger the new check.
* _make_think_prompts: add 'default' user phase fallback per member —
  these tests are about think-visibility / format_history, not
  instruction rendering.
* _open_ended_prompts / _judgeless_prompts: add judge keys (system +
  question + user.final). The "judgeless" name refers to the absence
  of a judges= dict, not the absence of a judge participant — the
  canonical _SCHEDULE_SLOTS *does* schedule a judge agent.
* _make_field_prompts: add verifier user templates + 'default' phase
  fallbacks so field-extraction tests work with any schedule.
* test_format_history_attributes_both_debaters_distinctly: add
  per-member default user templates (test is about wrap-template
  attribution, not user-instruction rendering).
* test_num_rounds_is_per_member_under_asymmetric_schedule: replace
  phase 'closing' (not in selfplay.yaml pack) with 'critique' — this
  test asserts on slot counts per member, not phase semantics.

340 / 340 + 1 skipped collectable orchestrator tests pass against the
new verifiers HEAD.

* chore: bump verifiers pin b723fda → 42a965e

Captures the two fork PRs that just landed on joanvelja/verifiers main:

  f4de712e feat(envs): add gpqa_rlvr (single-agent RLVR) + gpqa_consultancy
  78533ea7 fix(debate): validate effective prompt instruction coverage
          (the schedule×prompts init-time check)
  42a965e3 Merge GPQA baseline environments (HEAD)

Both were authored in this PR's branch stack (companion verifiers-side
commits). This final bump on the prime-rl branch makes the MA wiring,
new configs, and new env packages depend on a reproducible upstream
SHA rather than a moving HEAD.

Re-validated: 340 orchestrator unit tests pass + 1 skipped (torch-gated
ckpt round-trip) against the new verifiers HEAD via the verifiers venv
with prime-rl installed editable + --noconftest. No behavior change.

* chore(tmp): zebra pass@N headroom probe for Isambard

vLLM pass@{1,8} probe on Qwen3-4B-Instruct over 3x3/4x4 zebra buckets,
with Slurm wrapper and format-sanity sample. Parquet stays local.

* chore: signpost LoRA-self vs base pre-flight smoke for first GPU run

Three-layer signpost so the smoke is unmissable when the next session
loads on a GPU for the first learner-vs-fixed debate training run in
the LoRA-self topology (single vLLM hosting learner adapter + base).

  1. skills/preflight-lora-smoke/SKILL.md
     Auto-surfaces to agents working on "LoRA", "external opponent",
     "first GPU run", "enable_lora", "load_lora_adapter" contexts.
     Documents the three failure modes the web search turned up on
     vLLM 0.19 and how to interpret probe failures.

  2. scripts/preflight_lora_smoke.py
     Executable, ~200 LOC, three probes with PASS/FAIL output:
       - mixed-batch correctness (base and adapter coexist in one batch)
       - hot-swap idempotence (the #18372 probe: 3rd+ swap dropping)
       - per-request perf delta on LoRA-enabled server (#10898 tax)
     Non-zero exit on any failure; tells the operator to fall back to
     the two-instance topology if triggered.

  3. Stage-3 plan-doc stanza pointing at the skill + script, scoped
     specifically to the LoRA-self variant (external-API-opponent path
     is unaffected and needs no pre-flight).

Motivated by vllm-project/vllm issues 18372, 33791, 10898, 10062,
10617, 7977 surfaced during feasibility research. The pattern is
architecturally supported (NeMo-Aligner ships it for DPO/IPO; vLLM
docs document it) but under-exercised in prime-rl specifically.

Not a behavior change. No test additions -- the script itself IS the
test, gated behind live GPUs which aren't available from CI.

* chore: bump verifiers pin 42a965e -> 35826af (PR #4 squash)

Picks up the agent_bindings_fn feature from joanvelja/verifiers#4:
state-aware per-member (client, model) routing on MultiAgentEnv,
gpqa_debate external-opponent branch with learner_seat policy + pin,
shared-vLLM / LoRA-self topology support, runtime bindings validation.

Unblocks Task #11 (prime-rl learner_seat MemberRollout filter) to start
reading output.info["learner_seat"] set by the env-pack.

* feat(orchestrator): filter MemberRollouts by learner_seat

Stage 7 of the external-opponent debate pipeline. The verifiers-side
(PR #4) stamps info.learner_seat per row when opponent_model is set;
this side filters the fan-out so the frozen opponent's and judge's
trajectories never reach the trainer.

Changes:

1. fan_out_for_multi_agent gains `filter_by_learner_seat: bool = False`.
   When True, reads rollout.info['learner_seat'] and keeps only that
   member's unit. Missing info.learner_seat raises -- enabling the
   filter on a self-play env is a config mismatch, not a silent no-op.

2. MultiAgentConfig.filter_by_learner_seat: bool = False (new). Described
   in Pydantic Field so the TOML comment is auto-generated.

3. Orchestrator threads the knob into the fan-out call and the startup
   log line. No new validation gate -- the fan-out's runtime raise
   already fails loud on misconfigured envs.

4. Two new tests mirroring the existing drop_judge pair: filter=True
   keeps only the seated member; filter=True + missing info raises.

5. configs/gpqa_debate/rl_external_opponent.toml -- runnable config
   for the two-server topology (learner on orchestrator vLLM, opponent
   + judge on api.openai.com). Eval pins seat A for determinism across
   checkpoints. Comments at top point at the LoRA-self variant and the
   preflight smoke it requires.

Cannot run tests locally (prime-rl lockfile is Linux-only); CI will.

* fix(orchestrator): address two Codex P1s on MA path

Two real bugs surfaced by Codex review of the MA fan-out path:

1. Custom advantage in MA mode silently corrupts gradients.
   The validation at line 225 correctly rejected advantage.type='default'
   for MA envs with the exact reasoning that compute_advantages' fixed-
   size reshape mixes seats/episodes under fan-out interleaving -- but
   allowed advantage.type='custom' through to the same broken code path.
   Same latent hazard for advantage=None. Tighten to "MA requires
   ema_per_member"; delete the dead else branch that would have called
   compute_advantages on the interleaved fan-out list.

2. Training-usage billing overstated by filtered-unit tokens.
   The MA fan-out refactor split "produce samples" from "filter samples":
   apply_filters marks unit['is_filtered'] without removing the unit,
   process_unit still returns samples for filtered units, and the
   accumulation loop tallied their tokens into num_prefill_tokens /
   num_decode_tokens before the train_examples.append gate. Those
   totals feed usage_reporter.report_training_usage(usage_type="training",
   tokens=...), so filtered rollouts were billing training that never
   happened. Gate token accumulation on is_filtered; leave
   rollout_total_samples alone since that's a "samples generated" count,
   which correctly includes filtered.

Behavior changes on intended configs: none -- no recipe in-tree uses
custom+MA, and the filtered-token undercount moves the billing number
toward the truth, not away.

---------

Co-authored-by: samsja <55492238+samsja@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Mika Senghaas <mail@mikasenghaas.de>
Co-authored-by: hallerite <git@hallerite.com>
Co-authored-by: JannikSt <JannikSt@users.noreply.github.com>
Co-authored-by: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Co-authored-by: rasdani <73563550+rasdani@users.noreply.github.com>
Co-authored-by: Jupiter <jupiterz@umich.edu>
Co-authored-by: Dominik <me@dominikscherm.de>
Co-authored-by: sami jaghouar <sami@primeintellect.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants